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Abstract. Solvent-mediated hydrodynamic interactions between colloidal par- 
ticles can significantly alter their dynamics. We discuss the implementation 
of Stokesian dynamics in leading approximation for streaming processors as 
provided by the compute unified device architecture (CUDA) of recent graph- 
ics processors (GPUs). Thereby, the simulation of explicit solvent particles is 
avoided and hydrodynamic interactions can easily be accounted for in already 
available, highly accelerated molecular dynamics simulations. Special empha- 
sis is put on efficient memory access and numerical stability. The algorithm is 
applied to the periodic sedimentation of a cluster of four suspended particles. 
Finally, we investigate the runtime performance of generic memory access pat- 
terns of complexity 0{N^) for various GPU algorithms relying on either hard- 
ware cache or shared memory. 



1 Introduction 

Micron-sized colloidal particles are suspended in a solvent for most applications. The parti- 
cle motion couples to the flow field of the solvent, which can significantly affect the dynamic 
properties. Already the Brownian motion of a single particle in a viscous solvent creates a 
slowly decaying flow pattern with feedback on the particle at a later point in time. This eff'ect 
is known as hydrodynamic memory and leads to an experimentally relevant "coloured" noise 
spectrum The flow field due to a steadily dragged particle, first calculated by George 
G. Stokes, decays slowly with distance and exerts a drag on nearby particles referred to as 
hydrodynamic interactions. They are intrinsically of many-body nature and in general pro- 
duce chaotic trajectories with nontrivial and sometimes surprising effects. For example, three 
colloids driven along a circle exhibit an intriguing cyclic motion |2| due to the "peloton 
eff'ect": the mobility of a pair of close colloids is enhanced. Another example are sediment- 
ing suspensions in a slit pore, where the formation of complex spatial patterns, related to 
a Rayleigh-Taylor instability, was observed in experiments and simulations |3, 4|. Further, 
hydrodynamic interactions enhance the self-diffusion of colloidal particles confined to a fluid 
interface H and, in the absence of long-range repulsive forces, accelerate their capillary 
collapse f6l. 

Computer simulations of colloidal suspensions including hydrodynamic eff'ects are chal- 
lenged by the vast amount of solvent molecules per colloid particle. Various techniques have 
been developed which use coarse-grained solvents to approximate the fluid properties at 
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mesoscopic scales. Popular methods are the lattice-Boltzmann technique and multi-particle 
collision dynamics, see, e.g., Refs. 7 -9 and references therein. In particular for diluted sus- 
pensions, such approaches have the drawback that the fluid representation consumes most 
of the computational resources. Complementary to methods with explicit solvent are Stoke- 
sian dynamics simulations |FTO) TT|, where the solvent-mediated interactions are accounted 
for by a many-particle mobility matrix constructed from hydrodynamic tensors leading to 
deterministic equations of motion. The correct incorporation of Brownian motion |10| is 
computationally expensive, and often approximate algorithms are used [12| . Here, we re- 
strict to phenomena where Brownian motion is of relatively low importance compared to the 
overall motion (large Peclet number) and can be neglected, examples can be found in Refs.|6] 
and[l3-15 

Computing hardware has seen a paradigm shift during the last decade from single-core 
processors to highly parallel multi-core architectures. Driven by the consumer market for 
video games, graphics processing units (GPUs) have been developed that contain several 
hundred tiny processor cores on a single chip. These subunits are specialised on streaming 
numerical computations of large data sets in parallel. Today, such streaming processors are 
often part of new installations in high-performance computing centres. 

The success of conventional molecular dynamics simulations on GPUs fTSl suggests to 
exploit the potential of GPUs also for simulation techniques with more complex interactions 
and advanced algorithms. In general, particle-based continuum simulations are more chal- 
lenging than lattice models with respect to algorithms and runtime performance IJTl [TSlI as 
well as numerical stability and floating-point precision |fT9l l20ll . Here, we will specifically 
address simulations of suspended particles with direct hydrodynamic interactions instanta- 
neously mediated by an implicit solvent. GPUs were used previously to simulate hydrody- 
namic interactions in suspensions of active dumbbells 1 15 1. A hybrid implementation scheme 
with explicit solvent is described in Ref 21 where the particles are treated by the host pro- 
cessor(s) and the solvent is modelled as lattice-Boltzmann fluid and propagated by the GPU. 



2 Theoretical background 
2.1 stokes equation 

The starting point for a mesoscopic description of the solvent is the Navier-Stokes equation 
for incompressible flow (low Mach number) of Newtonian fluids f22\. 

e(5,u + u- Vu) = 77V2u-V/?H-f, V-u = 0, (1) 

u - u(r, t) denotes the velocity field of the fluid, g the uniform mass density, rj the shear 
viscosity, p - p(r, t) the pressure field, and f = f(r, t) an external force density. Such a 
continuum approach holds at length and time scales much larger than the molecular scales of 
the solvent (low Knudsen number). The assumption of incompressibility is quite well fulfilled 
by most fluids at standard temperature and pressure. 

Suspended colloidal particles impose additional boundary conditions on the fluid. For 
not too small particles, there is ample experimental evidence that the fluid molecules at the 
particle boundary adopt the velocity of the latter, which is referred to as stick boundary 
conditions, 

u(r) = V -H n X (r - ro) for re dV. (2) 

Here, v and Q are the translational and angular velocities of the immersed object, respec- 
tively; To is a reference point of the particle and dV denotes its surface. 

The physics of micron-sized objects like bacteria or colloidal particles is mostly domi- 
nated by viscous friction (low Reynolds number), which permits neglecting the inertia term 

■ Vu in Eq. ([T]i, known as creeping flow limit. The time scale t for the velocity field 



Will be inserted by the editor 



3 



to adjust to a local disturbance is set by the diffusive propagation of shear waves together 
with the largest typical length scale L of the problem Il23l l24ll : for water and L - 10 [xm, 
T a; gL^/rj x 10^* s. For sufficiently slowly changing boundary conditions, the fluid response 
can be considered instantaneous and the time derivative gd,u can be dropped as well, which 
leads to the time-independent Stokes equation, 

-/yV^u^ -Vp+f, V-u = 0, (3) 

which forms the basis of Stokesian dynamics simulations. It has some analogies to the Pois- 
son equation in magnetostatics, from which many concepts can be carried over The approx- 
imation of infinitely fast propagating hydrodynamic interactions is a serious limitation of 
Stokesian dynamics and needs to be justified for every problem, in particular when consider- 
ing large system sizes or an externally prescribed time -dependence of the boundaries ll24ll . 



2.2 Hydrodynamic interactions 

The flow field u(r) = G(r) ■ F due to a point force, f (r) = F <5(r), acting on the fluid defines 
Green's function of the Stokes equation, the Oseen tensor Ii25il . 

G(r)^ -^(l + rr); (4) 

the notation ab indicates a second-rank tensor with components (ab)^/} - Oab/s, lap = Sa/i is 
the unit tensor, and f = r/r with r - |r|. The most prominent property of the Oseen tensor 
is the slow 1/r-decay of its magnitude reflecting the long-range nature of hydrodynamic 
interactions. 

The boundary condition, Eq. (|2]), for the suspended particles can be replaced by an (a 
priori unknown) induced force density f(r) localised on the particle surfaces, continuing the 
velocity field inside the particles. By the Stokes equation (|3]l, f(r) generates a flow field u(r), 
which in general depends on the shapes and orientations of the particles. The contribution 
U()(r) from a single particle in a quiescent fluid is obtained within a multipole expansion 
analogously to the treatment in magnetostatics |26j. Keeping only the leading, monopole 
contribution, it reads 

Uo(r) = I d^r' G(r - r') ■ f(r') = (l + ^V^] G(r) ■ F + Oir''^) for r » a, (5) 



with total force F - Jf(r)d^r, (root-mean-square) radius of the particle surface a, and the 
particle being centred at the coordinate origin. An external drag force can be included in the 
force monopole, and the flow field for many particles is found by superposition of the single 
particle contributions. 

The velocity distribution on the surface of another particle at position r may be expanded 
in multipoles as well, the monopole term yields just the particle's translational velocity v; its 
computation is facilitated by Faxen's theorem Ii23l . 

/ ^2 \ 

(6) 



The velocity dipole determines the angular velocity, and higher terms vanish for stick bound- 
ary conditions. Rotational motion leads to a coupling between velocity and force monopoles, 
which is of order r^* and, at the monopole level, limits the consistent expansion of Eq. (|5]l to 
the third order in distance. 
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Combining Eqs. (|5]l and (|6]l for particles encodes the hydrodynamic interactions in a 
configuration-dependent many-body mobiUty matrix JU,^(ri, . . . , r^v) relating forces to veloc- 
ities, 

N N 

v, = ^ju,.^.-F,=;UoF, + ^T(r,-r,)-F,- for i^U...,N; (7a) 

Latin subscripts refer to particle indices. The self-contribution ju„ describes the direct effect 
of a drag force on the particle itself and is set by the Stokes mobility, /io = \l6nria. The 
off-diagonal terms are given by the Rotne-Prager tensor, 

T(r) = + y V^j G(r) = j(,o^(/+ +//o|^(/- 3ff) . (7b) 

If the particle size is negligible compared to the particle separation, the gradients in Eqs. (|5]l 
and (|6| may be dropped, replacing 7 by the Oseen tensor G. An advantage of the Rotne- 
Prager tensor, necessary for the simulation of Brownian motion, is that the matrix ju is positive 
definite if no pair of particle centres is closer than 2a ll27l . 



2.3 Lubrication correction 

If two particles come close together, the hydrodynamic coupling is only poorly described 
by the far-field expansion. In particular, the many-body friction matrix, ^ - fi^^, becomes 
singular if only a thin lubrication film separates the particle surfaces. Cox ["281 computed 
the leading singularities for two approaching surfaces of arbitrary shape, and asymptotic 
expansions for two spheres were given by Jeffrey and Onishi Il29l . Let us introduce relative 
positions and velocities, r,y = r, - and v,j - v, - Vy, and the dimensionless separation 
between particle surfaces, Sjj - rij/a - 2. Then, the singular part of the force on sphere / due 
to a sphere j moving nearby reads l29l 

f,Arv.-^og(.,)(/-f,f,).v, + 0(4), (8) 

again ignoring contributions from rotational motion. We truncate ad hoc at Sij = 1 or r, j - 3a, 
where the logarithm changes its sign, avoiding a sign change of the lubrication forces in 
the present approximation. The lubrication correction to the many-body friction matrix is 
constructed from close particle pairs alone, such that F""° - 2j^-"'^Vj. The result from 
lubrication theory has to be matched with the far-field expansion of the inverse mobility 
matrix ifTTllSOll . Since we kept only the singular terms of ^ in Eq. (|8]l it can simply be added, 

V = (ju-' + F = ju (/ + F. (9) 

3 Simulation algorithm and implementation 
3.1 Algorithm 

The idea behind Stokesian dynamics simulations is to avoid the explicit simulation of the 
individual solvent molecules as one would do in traditional molecular dynamics simulations. 
Rather, hydrodynamic interactions between colloidal particles are properly incorporated by 
means of the many-body mobility matrix jU defined in Eq. (|7|. This matrix is entirely con- 
structed from pair contributions, even at higher orders of the multipole expansion. 
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Our main motivation for including the lubrication correction is the implementation of a 
hard core repulsion. Hence, we suggest the approximation v ju(/-^'*'"^p)Fto Eq. which 
avoids the storage and inversion of the 3A^ x 3A^ matrix ju. Further, we approximate on the 
right-hand side ju^'*'"°juF » p^^'"^v using the velocities from the previous simulation step; 
effectively, the particle forces are corrected for lubrication before the velocity update. Then, 
the simulation algorithm consists of repeating the following steps; 

1. update particle positions: 
r,' «— r,- + 6t V,- 

2. compute forces on each particle due to external forces, interactions, and lubrication: 

Fi ^ F-'(r/) + Z'j Fr(r/, r^) - Z'j C^'O^^ ^j) ' (v/ - v>) 

3. compute particle velocities via mobility matrix: 
V/ ^ Y,jiiij(ri,rj)-Fj 

Particle positions are propagated by a simple Euler integration in step 1 . The primed sums 
in step 2 can be accelerated by Verlet neighbour lists common for short-ranged forces as 
described, e.g., in Ref. 19 Since the hydrodynamic interactions are long-ranged, like electro- 
statics or gravity, it is preferable to keep all interactions of every pair of particles and not to 
cut off at some short distance[^Thus, the sum in step 3 is the computationally most intense 
part involving A^^ contributions for all particles; the mobility matrix itself, however, need 
not be stored in memory. 

3.2 GPU implementation 

The present work relies on the "compute unified device architecture" (CUDA) introduced by 
the Nvidia Corporation |31|. It provides some abstraction to the actual GPU hardware and 
makes Nvidia GPUs freely programmable using the C++ language. The central concept is 
thread parallelism, which means that several hundred or thousand program threads execute 
the same routine on the GPU, the latter is referred to as "kernel". Each group of (currently) 
M = 32 threads forms a "warp", several warps are organised into a thread block, and all 
blocks together represent the full set of threads. The threads within a warp execute simulta- 
neously, but the execution order of warps is not guaranteed. The warps comprising a block 
are resident on the same multiprocessor and execute independently, the size of a block is lim- 
ited by the available hardware resources. Communication between threads is only supported 
within the same block using "shared" memory, which does not persist beyond the lifetime of 
the block. Data input and output of a kernel occurs through the "global" memory, which is 
the only persistent type of memory on the GPU board. Despite the high overall bandwidth 
of global memory, each memory transaction entails a huge latency of several hundred clock 
cycles, which is, however, partly hidden by the thread scheduler on the chip. Optimal ac- 
cess to global memory is ensured if a warp of threads reads from or writes to a sequence of 
contiguous memory locations that fit within a single, properly aligned window of 64 or 128 
bytes. Then, the memory access of the whole warp is coalesced into a single transaction of 
a large block. As another remedy for the high memory latency, the Nvidia GPUs of CUDA 
compute capability > 2.0 ("Fermi") have been equipped with hardware caches. We put par- 
ticular attention to memory transfer because it is typically the bottleneck of a GPU program; 
it appears to become the limiting factor in conventional high-performance computing as well. 
We are not concerned with memory transfer between host system and GPU device since our 
simulation algorithms are fully implemented for the GPU. 

' We use a simulation box with periodic boundaries and include only hydrodynamic interactions 
from the nearest image of each particle. Then, the finite size of the box effectively imposes a cutoff at 
large distances. Having applications to colloidal dispersions in mind, the long-ranged interaction with 
all periodic images is ignored since the periodicity has no physical grounds and is as artificial as a sharp 
cutoff. 
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Algorithm 1 Pseudo code of the CUDA kernel for updating the particle velocities in the 
naive implementation, for an illustration see Fig.[T^. The kernel is executed concurrently by 
at least threads, array indices are 1 -based, and arrays in global memory are denoted by 
capital letters. 

function uPDATE_VELOciTY(position array R, velocity array V, force array F, #particles N, . . .) 
local j «— global thread ID 

local r «— Rj > read from global memory 

local V <— 

for k from 1 to W do > loop over all particles 

local r', f ^ R{., > read from global memory 

it j = k then > velocity contribution according to Eq. (|7j 

V <— V + /io f- > self-contribution 
else 

V <— V + T(r - r') • f. > use hydrodynamic tensor 
end if 

end for 

V; <— V > write to global memory 

end function 



The most straightforward implementation of step 3 on a GPU is to assign one thread to 
each particle, which computes and accumulates the contributions from all particles (in- 
cluding the self-contribution) to the velocity of the thread's particle, see Algorithm [T] The 
symmetry fij^. - f/jj^ will not be made use of for the same reason why Newton's third law 
is not exploited for the computation of forces from pair interactions: transferring the result 
from thread j to an arbitrary thread k would break thread parallelism and would require a 
sophisticated communication pattern involving a bloat of global memory transactions. The 
kernel basically consists of a loop over all particles, sequentially reading and processing their 
positions and forces from global memory. Before exit, the kernel stores the computed veloc- 
ity, which is the only global write operation. The threads within a warp can diverge (line [7| 
due to the self-contribution of a particle to its velocity, which unavoidably happens once for 
every thread in a warp and does not seem to significantly impact the execution time. 

Following Nyland et al. |321, the naive read pattern in line [6] is excessively inefficient: 
first, the same location in memory is accessed by every thread (Fig.[T^) and, second, memory 
bandwidth is wasted. The requirements for optimal memory access can be met by tiling the 
matrix of N x N interactions into squares of size Nb Il32ll . see Fig.[T]3; a similar approach 
has been proposed for general matrix multiplication ||3TI . Nb denotes the number of threads 
per block, it varies typically between 128 and 1024. The interactions are processed tile by 
tile and the data for a tile are cached in shared memory, making them available to all threads 
within the block (Fig. [T]J), and reducing the number of global memory reads by a factor 
of Nb/M. For gravity |32| or matrix multiplication |31 1, the size of a data item is smaller 
than in our case of hydrodynamic interactions. Here, the information per particle comprise 
two three-dimensional vectors, a position and a force, amounting to 32 bytes per particle 
(2 X float4) if alignment restrictions are obeyed. The relatively high shared memory usage 
of 32 X M = 1024 bytes per warp limits the multiprocessor occupancy on older hardware 
(CUDA compute capability < 2.0) with a potential increase of kernel execution time; for 
compute capability 2.0, maximum occupancy is exhausted by 1 kb of shared memory per 
warp |3T|. The shared memory usage can be significantly reduced without increasing the 
volume of global memory transfer if merely the first warp of a block fetches the data, the 
tiles become of rectangular shape M x Nb- The resulting small-tiling version is displayed in 
Algorithm|2]and illustrated in Fig.[T];. The original tiling algorithm is recovered by replacing 
M with the block size and removing the condition in line [9] 
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Fig. 1. Visualisation of various algorithms for the computation of hydrodynamic interactions, see Al- 
gorithms[T|and[2] Blue horizontal arrows represent the threads being associated with one particle each, 
gradually computing hydrodynamic interactions with all N particles (purple boxes on top); only one 
block of Nij threads is shown. Purple, vertical arrows represent global memory accesses; green rect- 
angles symbolise data fetching and pink ones data processing, i.e., the actual computation, (a) naive 
implementation. Data of one particle at a time are read from global memory directly without caching, 
(b) tiling algorithm. Data are read from global memory by coalesced transactions and are stored block- 
wise in shared memory for further processing, (c) small-tiling algorithm using small tiles of size M, it 
is similar to the tiling algorithm, but the coalesced read is performed only by one warp of each thread 
block (delimited by the dashed line), (d) access to shared memory (red arrows/box) within one thread 
block for the tiling (M = A'j) and small-tiling algorithms. 



Data fetching from global memory uses adjacent memory locations and allows for coa- 
lescable access (line[TO]l, in contrast to line[6]of Algorithm[T] It is followed by a synchronisa- 
tion barrier (line[T2| to prevent the other warps from premature reading from shared memory. 
The condition in line|9]does not break convergent thread execution within each warp (the con- 
dition is fulfilled either by each thread of a warp or by none), and only memory access within 
a warp is eligible for coalesced transactions. We see only the slight disadvantage with respect 
to execution time that fewer transactions are pending on a multiprocessor, which impedes la- 
tency hiding by the scheduler After data fetching is completed, all threads of the block read 
from shared memory and compute the velocity contribution from hydrodynamic interactions 
for their respective particle. The tile is closed by another synchronisation barrier (line [2T] i 
signalling that the cached data may be overwritten. 

The velocity sum in Eq. (|7]) contains contributions of very different magnitude since the 
terms decay like the inverse particle separation, 1 /r,y. Conveying the experience from con- 
ventional MD simulations 1 19], the numerical error can significantly be reduced if the accu- 
mulation in line[T8]of Algorithm[2]is performed with higher floating-point precision, whereas 
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Algorithm 2 The small-tiling variant of algorithm [T] Position and force data are partitioned 
in small tiles and prefetched into shared memory by the first warp of each block, for an 
illustration see Fig.[T];,d. Arrays in shared memory are denoted by a tilde. 



1 
i 


function uPDATE_VELOciTY(position array R, velocity array V, force array F, #particles N, . . .) 


2 


local j «— global thread ID 




3 


local n <— thread ID within block 




4 


local M «— warp size 




5 


allocate shared memory i^[M],f[M] 


> positions, forces 


6 


local r «— Rj 


> read from global memory 


7 


local V «— 




8 


for m from 1 to W with stride M do 


> loop over all particles in tiles of size M 


9 


if thread in first warp tiien 




10 


r»,f/i * R/)(+;() F;„+„ 


> copy from global to shared memory 


11 


end if 




12 


synchronise threads 


> wait until data are fetched 


13 


for k from 1 to M do 


> loop over fetched data 


14 


local r',f' <— fi-,fj. 


> read from shared memory 


15 


if / = m + k then 


> velocity contribution according to Eq. |7j 


16 


V <— V +/Jof'- 


> self-contribution 


17 


else 




18 


V <- v+ G(r-r') -f. 


> use hydrodynamic tensor 


19 


end if 




20 


end for 




21 


synchronise threads 


> wait until data are processed 


22 


end for 




23 




> write to global memory 


24 


end function 





the hydrodynamic tensor and the forces are computed in single, 24-bit precision. Still reading 
particle positions and forces in single precision keeps down the memory transfer volume. For 
CUDA compute capability > 2.0, native double precision may be used, for older hardware, 
an emulation via double-single arithmetic is available and can even be more efficient lfT9l . 
Finally, since an all-pairs interaction does not involve selective reading, e.g., via Verlet neigh- 
bour lists, the physical positions of the particles do not affect the kernel execution time and 
particle reordering is not necessary. It is, however, required for the efficient computation of 
interparticle forces and the lubrication correction. 



3.3 Runtime performance 

The described scheme for the computation of hydrodynamic interactions. Algorithm |2] has 
been implemented as a module extending HAL's MD package |33|, which forms the basis 
for the following measurements of runtime performance. The benchmarks mimic ffie sim- 
ulation of a sedimenting suspension: all particles experience the same constant drag force 
F - (0, 0, -F) due to their buoyant weight, which together with the particle radius a and the 
Stokes mobility /io defines the unit of time, Tj - ajuoF. The particles were initially placed 
on an fee lattice with a fixed number density of n - N jLr' - 0.1a"'', and the number of 
particles A'^ was varied between 150 and 200,000. Particle positions were propagated by ffie 
Euler integrator using a timestep of O.OOIt,, and simulations were run for 100 steps. The 
GPU hardware for the benchmarks was an Nvidia Tesla C2050, the program was compiled 
with the CUDA software development kit (SDK) version 4.0 using the option -arch sm_1 3, 
and kernels were launched with 512 threads per block. Note that the simulations were about 
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Number of particles Number of particles 



Fig. 2. (a) Simulation runtime per step for the computation of hydrodynamic interactions using CPU 
and GPU implementations of Algorithm[2j straight lines are fits to the regimes of linear and quadratic 
scaling with particle number, (b) Relative contributions to the total GPU runtime from the computation 
of hydrodynamic interactions (hatched bars memory access only, without computation), of lubrication 
forces, and integration of the equations of motion; the chosen particle numbers correspond to the two 
scaling regimes and the crossover region. 



2.3-times slower on the same hardware if the compiler optimised for the actual target ar- 
chitecture with -arch sm_20, probably due to a different implementation of floating-point 
arithmetic I^For reference, we compare with an equivalent serial implementation for the CPU 
(Intel Xeon E5620 clocked at 2.40 GHz), with the only difference that the symmetry ju,^ - fip 
is exploited. 

The measured runtimes per simulation step are displayed in Fig. [2] One infers clearly that 
the runtime of the GPU implementation scales quadratically in for large system sizes, while 
it grows only linearly for N < 2, 000 particles. The former quadratic scaling is also found for 
the CPU version for all particle numbers, simply reflecting that the interaction between all 
particle pairs requires 0{N^) computations. In the scaling regime, the speedup factor of the 
GPU over the CPU is about 120. The linear scaling for not too large particle numbers origi- 
nates from a partial usage of GPU resources: doubling the number of particles, the execution 
time of each CUDA thread doubles, but so does the number of threads running concurrently. 
In summary, the total kernel runtime only doubles. When the device capacity is exhausted 
not all threads can run in parallel anymore, yielding the quadratic increase of the runtime 
with the number of particles. A small performance gain of 13% is achieved by computing the 
hydrodynamic interactions only to order (9(r i.e., in the Oseen approximation. 

In Section [J!2] we developed the small-tiling algorithm for efficient access to global mem- 
ory. Surprisingly, we found that a variant of the eventual kernel for the computation of hydro- 
dynamic interactions using the naive memory access pattern slightly outperforms the more 
sophisticated algorithm, the runtime in the scaling regime is reduced by 8%. This observa- 
tion motivated us to scrutinise the performance of the different memory access algorithms 
depending on the specific access pattern, see Appendix |A] For linear data access, the naive 
implementation performs increasingly better as the size of the individual data elements in- 
creases, and finally the overhead of data caching penalises the tiling algorithms. In the present 
case, both criteria are met: particle data are laid out linearly and 32 bytes are read for each 
interaction. 



^ The less IEEE-compliant floating-point arithmetic of CUDA compute capability 1.3 is approxi- 
mately restored by the additional CUDA compiler flags -ftz=true -prec-div=false -prec-sqrt=false. 
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a b 




Simulation time [Tj] 



Fig. 3. Periodic sedimentation of four equal spheres, (a) Initial configuration of the particles on the 
edges of a stretched tetrahedron, edges of equal length are symbolised by equal line patterns. Gravity 
acts downwards as indicated by the arrow, (b) Periodic trajectories for two selected coordinates of 
particles 1 and 2 and the vertical component of the centre of mass, Zcm, after correcting for the average 
sedimentation motion. 

4 Numerical stability: periodic sedimentation of four spheres 

In Stokesian dynamics simulations, there are no obvious conserved quantities like energy or 
momentum, but one may exploit a symmetry to examine the numerical stability of the simula- 
tion. The Stokes equation (|3]l is invariant under time reversal (u i-> -u) if positions and forces 
are reversed as well, r i-> -r, f i-> -f . The symmetry of the Stokes flow is preserved for fore- 
and-aft-symmetric particle configurations, i.e., for boundaries invariant under r i-> -r. These 
observations can be used to construct a cyclic motion in an external gravitational field 1341 : 
four equal particles are initially positioned at the vertices of a stretched tetrahedron with two 
opposing edges perpendicular to the force, see Fig. |3^. After a quarter period, all particles 
will be in a horizontal plane, and after sedimenting for another quarter period, the horizontally 
mirrored initial configuration is recovered. After another half period, the initial (relative) con- 
figuration is exactly restored. For the initial positions, we placed two horizontally aligned and 
vertically separated couples of particles at ri = (0, 0, 5a), r2 - (0, 5fl, 0), = -ri, r4 - -r2', 
each particle is dragged downwards by a force F - (0, 0, -F). The trajectories are deter- 
mined by a set of merely three different coordinates, e.g., xi(t), z\(t) and ziit), see Fig. [3j), 
the remaining components follow by symmetry: x^it) = -xi(f), ^2(0 = -y4(t) being equal to 
xi(t) shifted in phase by half a period, 13(1) - Zi(t), Z4(t) = Z2(t), and yi - y-^ - X2 - xn, = 0. 
The centre-of-mass position, Zcmif) - \z\{i) + zi{t)\j2, decreases monotonically, but oscillates 
around the average sedimentation motion. 

The cycle duration is about T - 517rj long, we have performed long simulations over 
almost 1000 cycles (50 million steps with an integration timestep of 0.0 Itj) in order to mon- 
itor the numerical stability. The maxima and minima of xx(t) were determined by parabolic 
fits yielding the cycle duration and the amplitude of the motion, the cycle duration was av- 
eraged over time windows of IO'^t,. We compare implementations which use either single 
or double-single floating-point precision 1 19 1 for the accumulation of (a) velocities due to 
hydrodynamic interactions, Eq. (|7]i, and (b) positions in the Euler integration of the equa- 
tions of motion. The contributions to the hydrodynamic interactions from different particles 
vary strongly for close and distant particles due to the 1/r prefactor; hence, large and small 
numbers of different sign are summed with great potential for numerical issues due to limited 
floating-point precision. 
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Fig. 4. Cycle duration and amplitude of the periodic motion over the course of a very long simulation 
run using different floating-point precision for the accumulation of hydrodynamic interactions (HI) 
and positions (integration of the equations of motion); the two datasets refer to different integration 
timesteps, St = O.OOITj and St = O.OIt, as indicated. Note that some curves collapse with the case 
"both double-single precisison". 



Over the course of the simulation, both the cycle duration and the amplitude display a 
drift independently of the floating-point precision, see Fig. |4] The motion, however, remains 
cyclic if double-single precision is used for the accumulation of hydrodynamic interactions. 
Otherwise, the periodic motion falls apart after some 400 cycles for the larger timestep, re- 
gardless of the precision used for the integration. Using the smaller timestep 6t = O.OOItj 
for the integration of positions, the drift is significantly reduced and the cyclic motion re- 
mains stable over the full simulation run of more than 500 cycles. The small timestep in 
conjunction with single precision for both the accumulation of hydrodynamic interactions 
and the integration leads to a wildly fluctuating cycle duration and amplitude, probably be- 
cause the low floating-point precision fails to properly resolve the small increments. Note 
that a smaller drift and an improved stability may be achieved by resorting to higher-order 
or implicit integration schemes rather than decreasing the timestep; a semi-implicit scheme 
has been suggested if lubrication forces are included |35|. The stability issues, however, are 
likely to persist if merely single floating-point precision is used throughout. 

Another sensitive criterion for numerical stability is that particles initially positioned in 
the jcz-plane (particles 1 and 3) should for symmetry reasons remain in this plane during 
the simulation. Thus, a non-zero y-coordinate quantifies the numerical error, the results are 
shown in Fig. [5] using 6t - O.OOItj. Note the much shorter simulation runs compared to 
Fig.|4] the criterion here is much more sensitive to detect numerical inaccuracies. If double- 
single precision is used for the accumulation of hydrodynamic interactions, we find ^2(0 = 0, 
independently of the precision of the integration. Slowly growing deviations are observed if 
only the hydrodynamic interactions are accumulated in single precision, and large errors arise 
quickly for positions integrated in single precision as well. 

In conclusion, the floating-point precision in the accumulation of hydrodynamic interac- 
tions impacts the long-time stability of the described simulation for only four particles, and 
it is likely to become more relevant for larger setups with a large distribution of particle sep- 
arations. For the tests discussed, the precision of position integration is less crucial as long as 
the hydrodynamic interactions are accumulated with increased floating-point precision, but 
we consider it good practice to do so for the integration as well. 
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Fig. 5. Deviation of particle 1 from the j:z-plane quantifying the numerical error. Accumulation with 
different floating-point precision is compared like in Fig.|4] 

5 Conclusion 

We have described a GPU accelerated implementation of Stokesian dynamics simulations, 
a scheme for dynamic simulations of a particulate suspension that includes direct, instanta- 
neous hydrodynamic interactions and avoids the explicit simulation of the solvent. Our treat- 
ment is restricted to the leading order of a multipole expansion, thus neglecting rotational 
motion, but can in principle be extended to higher orders. The implementation is available in 
HAL's MD package |33 | from version 0.2, an efficient simulation software specifically de- 
signed to run entirely on the GPU, thereby avoiding the costly data transfer between host and 
GPU memory; it further supports selected computations with increased floating-point pre- 
cision for an optimal balance between numerical stability and runtime performance. Since 
access to global memory on the GPU often limits the execution speed of compute kernels, 
we have put particular emphasis on the memory access patterns. Explicit data caching by 
means of the shared memory was compared to the hardware cache available in the latest 
generation of Nvidia GPUs. Dedicated algorithms using shared memory are in general more 
efficient for non-linear memory access. For linear access, they show advantages for small 
data sizes, the straightforward direct memory access, however, turned out to be an equally 
fast (or even slightly faster) alternative for the computation of hydrodynamic interactions, 
where a relatively large amount of data (32 bytes) has to be transferred for each particle. 

The computation of the long-range hydrodynamic interactions requires to consider all 
paks of particles in the simulation box, yielding an algorithmic complexity of order 0{N^) 
for N particles. This scaling is reflected in the runtime for large systems, and we observed 
a speedup of about 120 compared to the serial reference implementation for a conventional 
CPU. For up to 2,000 particles however, the runtime scales linearly on the employed GPU 
hardware since only a fraction of the available computing resources is used. 

Other simulation approaches for particulate suspensions treat the solvent explicitly, which 
typically results in a linear algorithmic complexity. For fixed particle number, the prefactor 
is proportional to the solvent volume, or equally, the inverse of the colloid volume fraction. 
Rohm and Arnold |21 1 describe molecular dynamics simulations of particles suspended in a 
GPU-accelerated lattice-Boltzmann solvent. For comparison, they report that one integration 
step takes about 0.9 ms for 4,150 particles in a solvent of lattice nodes, which is to be 
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Fig. 6. Illustration of the different access patterns. Data are assumed to represent a 3 x 4 matrix (left) 
stored in row-major order (white bar on top). "Linear" data access corresponds to reading a row of the 
matrix, "strided" access reads a column, and "unordered" access reads along a diagonal depending on 
the offset. Single numbers denote the sequence of data access, and comma-separated pairs indices of 
matrix elements. 



contrasted with 5.7 ms obtained in this work. Note that the execution time for Stokesian dy- 
namics is merely a function of the absolute particle number, while for the lattice-Boltzmann 
method it crucially depends on the particle density and the resolution of the solvent lattice. 
Hence, Stokesian dynamics will in general perform better for dilute suspensions. Similarly, 
Pham et al. f36l compared both methods (including Brownian motion) for a polymer chain in 
solution using conventional hardware and found Stokesian dynamics to be an order of mag- 
nitude faster even for chain lengths of 1000 monomers, despite the asymptotically favourable 
scaling of the lattice-Boltzmann method. Apart from aspects of computational performance, 
a comparison of methods with implicit and explicit solvent may be interesting from a theo- 
retical point of view for specific problems in physics. It would permit assessing the physical 
quality of the various approximations introduced by both methods. 

The hydrodynamic interactions are related in nature to other long-range interactions, e.g., 
gravity or electrostatic interactions, and it seems natural to convey more elaborate algorithms 
from there to the hydrodynamic case, an example being Ewald summation techniques |37|. 
Another interesting approach are fast-multipole methods and tree codes with 0{N \og{N)) 
scaling, where remote particle groups are replaced by their multipole moments; Bedorf et 
al. Il38ll39l describe an implementation of a Barnes-Hut tree code for gravity fully executing 
on the GPU, which is able to process 2.8 million particles per second. 

Finally, one would like to amend the simulations by Brownian motion ifTOl [T2I. which 
involves an additional noise term. In order to satisfy the detailed balance condition, a cor- 
rect computation of the noise strength requires a Cholesky decomposition of the many-body 
mobility matrix ju, which is computationally expensive and entails a substantial memory foot- 
print. Nevertheless, such a task seems to be well suited for a future GPU implementation. 

We thank J. Dunkel, T. Franosch, and A. Louis for discussions on Stokesian dynamics simulations and 
hydrodynamic interactions and P. Colberg for correspondence on GPU-related questions. 



A Data caching for memory access patterns of complexity 0{N^) 

In this appendix, we study the efficiency of data caching using shared memory or hardware 
caches for different memory access patterns of complexity 0{N^), which occur in the com- 
putation of all-pairs interactions or in basic linear algebra problems (BLAS). The study was 
motivated by the somewhat unexpected observation that the two implementations of global 
memory access in Algorithms [T| and |2] yielded similar execution times, even on older GPU 
hardware without memory cache. Three different memory access patterns, depicted in Fig.|6] 
are addressed: linear, linear with a fixed stride, and unordered scattered. 
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We reduced the algorithms of Section [3] as much as possible to concentrate on memory 
access. A minimal computation though was included in order to prevent the compiler from 
optimising out superfluous read statements. The three minimalist CUDA kernels are given in 
Algorithms [3] and |4] also see Fig.[T] The kernels are run with threads, each thread reads 
and accumulates an array of values and stores the result. The naive kernel resembles the 
straightforward implementation, smalltiling and tiling divide the array in tiles and use shared 
memory for explicit data caching. The memory access patterns were realised by the mappings 
k k for linear access, k <r^ kp mod for strided access, and k i-» (kp + jM) mod for 
unordered access; k is the index of the memory access loop, p = 131 a prime number larger 
than 128, M = 32 the warp size, and j the global thread ID. For linear access, the threads 
of a warp request data from adjacent locations permitting coalescable memory transfer. For 
the strided access pattern, two "consecutive" threads access data separated by /:> x i bytes 
{s being the size of the basic array data type), thus two array elements can not be fetched 
by a single transaction. For the unordered access, each thread traverses the data array with a 
unique offset preventing different threads to benefit from processing the same data. The three 



Algorithm 3 Minimalist naive kernel for benchmarking memory access, 
template <typename T> 

global void naive (T* g_out, T const* g_in, unsigned N) 

{ 

unsigned const j = GTID; // 
T sum = g_in[j] ; // 
for (unsigned k = 0; k < N; ++k) 
sum += g_in[map(k)] ; 



global thread ID 
read from global memory 



g-Out[j] = sum; 



// read from global memory 
// and a trivial computation 
// write to global memory 



Algorithm 4 Minimalist small-tiling kernel. The "tiling" algorithm is obtained by substitut- 
ing warpSize with the size of the execution blocks; the condition n < warpSize in line 10 
then always evaluates to true and can be dropped, 
template <typename T> 

global void smalltiling (T* g_out, T const* g_in, unsigned N) 

{ 

unsigned const j = GTID; // global thread ID 

unsigned const n = TID; // thread ID within block 

extern shared T s_data[]; // declare shared memory 

T sum = g_in[j]; // read from global memory 

for (unsigned m=8; m<N; m+= warpSize) { 
if (n < warpSize) 

s_data[n] = g_in[map(m+n)] ; // copy from global to 
syncthreadsO ; // shared memory 

for (unsigned k = 0; k < warpSize; ++k) 

sum += s_data[k]; // a trivial computation 
syncthreadsO ; 

} 

g_out[j] = sum; // write to global memory 
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Fig. 7. Runtime performance of different 0(A'^)-memory access patterns using various GPU imple- 
mentations, see Algorithms[3]and[4] The figures show kernel execution times divided by the number of 
pairs, A'^, in the scaling limit of a large number of array elements, A'. The two panels compare arrays 
of data type float (left) and float4 (right). Light (left groups) and solid (right groups) fillings distinguish 
results for GPUs without and with hardware cache, respectively; see text for details. Some runtimes of 
the naive implementation exceed the axis range and are printed at the top. Note that the total amount of 
transferred memory is larger by a factor of 4 for float4 values. 



patterns are reminiscent of traversing a p x p matrix stored in row-major order in different 
directions; reading a row (linear), a column (strided), or a diagonal (unordered), see Fig. [6] 

Runtime measurements were conducted on two systems with GPU hardware with and 
without on-chip memory cache. The first system hosted an Nvidia Tesla S1070 server (TIO 
chip, CUDA compute capability 1.3), and the second system contained four Tesla C2050 
("Fermi") cards (compute capability 2.0). The latter GPU is, among other improvements, 
equipped with a hardware cache of 48 kB. For both systems, we used the same kernel builds 
created by the CUDA SDK version 3.2 with the compilation option -arch sm_12. Kernels 
were run with 128 threads per block on a single GPU; kernel execution times were averaged 
over 100 repetitions and do not include data transfer from host to GPU memory. The array 
size was varied between N = 1 ,000 and 70,000 checking that the runtime divided by A'^ has 
converged, and the obtained runtimes are presented in Fig. [7] 

The runtime of the naive implementation depends sensitively on the access pattern, which 
nicely illustrates the effect of the hardware cache. In the absence of a cache and for arrays of 
data type float, the kernel with unordered access is slower by a factor of 13 compared to the 
linear pattern; this reduces to 9 on the C2050 card. The factor, however, increases from 15 
(S1070) to 23 (C2050) if the basic data type is float4 (16 bytes per aiTay element). Keeping 
the access pattern and the data type fixed, the largest observed runtime improvement was by 
a factor of 2.3 for unordered access and float data, which we mainly attribute to the presence 
of memory cache and the higher number of multiprocessors on the C2050 GPU. The overall 
observation of a higher performance gain using the latter hardware for float data is likely 
due to the fact that the cache can store simply more float than float4 data items, reducing the 
number of cache misses upon later access of the same data by another thread. 

The two tiling algorithms are between 1.3 (linear access) to 20 times (unordered access) 
faster than their respective naive implementations. Both algorithms exhibit similar execution 
times, they are even comparable for the different access patterns, solely the linear access is 
somewhat faster. This shows that manual caching via shared memory can efficiently hide 
even complex memory access. Comparing the two GPU generations, the C2050 is found 
to be slightly more efficient for float4 data, but to perform similarly as one of the SI 070 
GPUs for float. A notable exception is linear access to float data, where the "small-tiling" 
implementation is about 2-3 times faster than the original tiling algorithm 1(311 [32l . The 
advantages of the "small-tiling" version are expected to become relevant only for larger array 
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elements (e.g., 2 x float4), when the considerations on shared memory and multiprocessor 
occupancy given in Section[3]apply. 

Summarising, the built-in memory cache of the Tesla C2050 hardware yields signifi- 
cant improvements to global memory access with the straightforward, "naive" implemen- 
tation of our test kernels. In particular, this implementation performs well for linear access 
of float4 data in the presence of a hardware cache. It can, however, not compete with an 
explicit caching algorithm via shared memory, which can be designed to reflect the actual 
data processing scheme for optimal cache efficiency and coalescable memory transfer The 
"small-tiling" algorithm achieves in all test cases a high memory transfer bandwidth and 
seems to be rather robust with respect to access patterns and sizes of the array elements. It 
may be worth to investigate whether it can speed up matrix-vector multiplications, where the 
data of the vector are linearly read by all threads. 
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